Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture

نویسنده

  • Lukasz Debowski
چکیده

The article discusses two mutually-incompatible hypotheses about the stochastic mechanism of the generation of texts in natural language, which could be related to entropy. The first hypothesis, the finite energy hypothesis, assumes that texts are generated by a process with exponentially-decaying probabilities. This hypothesis implies a logarithmic upper bound for maximal repetition, as a function of the text length. The second hypothesis, the strong Hilberg conjecture, assumes that the topological entropy grows as a power law. This hypothesis leads to a hyperlogarithmic lower bound for maximal repetition. By a study of 35 written texts in German, English and French, it is found that the hyperlogarithmic growth of maximal repetition holds for natural language. In this way, the finite energy hypothesis is rejected, and the strong Hilberg conjecture is partly corroborated.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Relaxed Hilberg Conjecture: A Review and New Experimental Support

The relaxed Hilberg conjecture states that the mutual information between two adjacent blocks of text in natural language grows as a power of the block length. The present paper reviews recent results concerning this conjecture. First, the relaxed Hilberg conjecture occurs when the texts repeatedly describe a random reality and Herdan’s law for facts repeatedly described in the texts is obeyed....

متن کامل

Hilberg’s Conjecture: an Updated FAQ

This note is a brief introduction to theoretical and experimental results concerning Hilberg’s conjecture, a hypothesis about natural language. The aim of the text is to provide a short guide to the literature. 1 What is Hilberg’s conjecture? In the early days of information theory, Shannon (1951) published estimates of conditional entropy for printed English. A few decades later, Hilberg (1990...

متن کامل

Hilberg’s Conjecture — a Challenge for Machine Learning

We review three mathematical developments linked with Hilberg’s conjecture—a hypothesis about the power-law growth of entropy of texts in natural language, which sets up a challenge for machine learning. First, considerations concerning maximal repetition indicate that universal codes such as the Lempel-Ziv code may fail to efficiently compress sources that satisfy Hilberg’s conjecture. Second,...

متن کامل

On Hilberg's Law and Its Links with Guiraud's Law

Hilberg (1990) supposed that finite-order excess entropy of a random human text is proportional to the square root of the text length. Assuming that Hilberg’s hypothesis is true, we derive Guiraud’s law, which states that the number of word types in a text is greater than proportional to the square root of the text length. Our derivation is based on some mathematical conjecture in coding theory...

متن کامل

On normalizers of maximal subfields of division algebras

‎Here‎, ‎we investigate a conjecture posed by Amiri and Ariannejad claiming‎ ‎that if every maximal subfield of a division ring $D$ has trivial normalizer‎, ‎then $D$ is commutative‎. ‎Using Amitsur classification of‎ ‎finite subgroups of division rings‎, ‎it is essentially shown that if‎ ‎$D$ is finite dimensional over its center then it contains a maximal‎ ‎subfield with non-trivial normalize...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Entropy

دوره 17  شماره 

صفحات  -

تاریخ انتشار 2015